library(scater)
library(data.table)
library(cowplot)
library(DT)
library(knitr)
opts_chunk$set(cache = FALSE, warning = FALSE)The mouse cortex data from Zeisel et al, 2015 have been made available as UMI (unique molecular identifier) counts at the Linnarsson Lab website.
Here we download the data directly from the Linnarrsson Lab page and produce an SCESet object, which is the data structure used in the scater package.
First load the data for metadata for mRNA genes:
## download mRNA data
mrna_meta <- fread("http://linnarssonlab.org/blobs/cortex/expression_mRNA_17-Aug-2014.txt",
nrows = 10, header = FALSE)
mrna_meta <- as.data.frame(t(mrna_meta), stringsAsFactors = FALSE)
colnames(mrna_meta) <- mrna_meta[2,]
mrna_meta <- mrna_meta[-c(1,2),]
rownames(mrna_meta) <- mrna_meta$cell_id
datatable(mrna_meta[1:6,])Now load the count data for the mRNA genes:
mrna <- fread("http://linnarssonlab.org/blobs/cortex/expression_mRNA_17-Aug-2014.txt",
skip = 10)##
Read 50.1% of 19972 rows
Read 19972 rows and 3007 (of 3007) columns from 0.113 GB file in 00:00:04
mrna <- as.data.frame(mrna)
rownames(mrna) <- mrna$V1
mrna_gene_cluster <- mrna$V2
mrna <- mrna[, -c(1,2)]
colnames(mrna) <- mrna_meta$cell_id
datatable(mrna[1:20, 1:6])Next load the metadata for mitochondrial genes:
## download data for mitochondrial genes
mtgenes_meta <- fread("http://linnarssonlab.org/blobs/cortex/expression_mito_17-Aug-2014.txt",
nrows = 10, header = FALSE)
mtgenes_meta <- as.data.frame(t(mtgenes_meta), stringsAsFactors = FALSE)
colnames(mtgenes_meta) <- mtgenes_meta[2,]
mtgenes_meta <- mtgenes_meta[-c(1,2),]
rownames(mtgenes_meta) <- mtgenes_meta$cell_id
datatable(mtgenes_meta[1:6,])Followed by the count data for mitochondrial genes:
mtgenes <- fread("http://linnarssonlab.org/blobs/cortex/expression_mito_17-Aug-2014.txt",
skip = 10)
mtgenes <- as.data.frame(mtgenes)
rownames(mtgenes) <- mtgenes$V1
mtgenes <- mtgenes[, -c(1,2)]
colnames(mtgenes) <- mtgenes_meta$cell_id
datatable(mtgenes[1:10, 1:6])Finally load the metadata for ERCC spike-ins:
## download data for ERCC spike-ins
ercc_meta <- fread("http://linnarssonlab.org/blobs/cortex/expression_spikes_17-Aug-2014.txt",
nrows = 10, header = FALSE)
ercc_meta <- as.data.frame(t(ercc_meta), stringsAsFactors = FALSE)
colnames(ercc_meta) <- ercc_meta[2,]
ercc_meta <- ercc_meta[-c(1,2),]
rownames(ercc_meta) <- ercc_meta$cell_id
datatable(ercc_meta[1:6,])and then the count data for the ERCC spike-ins:
ercc <- fread("http://linnarssonlab.org/blobs/cortex/expression_spikes_17-Aug-2014.txt",
skip = 10)
ercc <- as.data.frame(ercc)
rownames(ercc) <- ercc$V1
ercc <- ercc[, -c(1,2)]
colnames(ercc) <- ercc_meta$cell_id
datatable(ercc[1:10, 1:6])Now, after checking that the cell IDs match up in the different data frames we have loaded, combine data and form into an SCESet object:
identical(mtgenes_meta[rownames(mrna_meta),], mrna_meta)
identical(ercc_meta[rownames(mrna_meta),], mrna_meta)
## cell metadata matches as long as cells are matched
mtgenes <- mtgenes[, rownames(mrna_meta)]
identical(colnames(mtgenes), colnames(mrna))
ercc <- ercc[, rownames(mrna_meta)]
identical(colnames(ercc), colnames(mrna))## combine expression values
counts <- rbind(mrna, mtgenes, ercc)
dim(counts)## [1] 20063 3005
pd <- new("AnnotatedDataFrame", mrna_meta)
fd <- data.frame(gene_cluster = c(mrna_gene_cluster,
rep(NA, (nrow(ercc) + nrow(mtgenes)))))
rownames(fd) <- rownames(counts)
fd <- new("AnnotatedDataFrame", fd)
sce_zeisel_raw <- newSCESet(countData = counts, phenoData = pd, featureData = fd)## Generating log2(counts-per-million + offset) from counts to use as
## expression data, with prior.count = logExprsOffset. See edgeR::cpm().
## Note that counts-per-million are not recommended for most analyses.
## Consider using transcripts-per-million instead. See ?calculateTPM.
## Defining 'is_exprs' using count data and a lower count threshold of 0
sce_zeisel_raw## SCESet (storageMode: environment)
## assayData: 20063 features, 3005 samples
## element names: counts, cpm, exprs, is_exprs
## protocolData: none
## phenoData
## sampleNames: 1772071015_C02 1772071017_G12 ... 1772058148_F03
## (3005 total)
## varLabels: tissue group # ... level2class (10 total)
## varMetadata: labelDescription
## featureData
## featureNames: Tspan12 Tshz1 ... ERCC-00171 (20063 total)
## fvarLabels: gene_cluster
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
This data were produced using UMIs, so cpm (counts-per-million) should be the correct units for analysis (transcript length becomes irrelevant as we are counting molecules directly).
Calculate QC metrics with scater’s calculateQCMetrics function. It is helpful to define “feature controls”, that is features (here genes) that can be used for QC purposes. Typically, it is helpful to define ERCC spike-ins and mitochondrial genes as feature controls, which is exactly what we do here:
ercc_genes <- grep("ERCC", featureNames(sce_zeisel_raw))
mt_genes <- grep("mt-", featureNames(sce_zeisel_raw))
sce_zeisel_raw <- calculateQCMetrics(
sce_zeisel_raw, feature_controls = list(ERCC = ercc_genes, mt = mt_genes))For this data set the original authors classified the cells into the broad cell classes shown in the output below.
table(sce_zeisel_raw$level1class)##
## astrocytes_ependymal endothelial-mural interneurons
## 224 235 290
## microglia oligodendrocytes pyramidal CA1
## 98 820 939
## pyramidal SS
## 399
We will explore the cell types throughout the QC procedures described subsequently.
Scater allows you to set minimum QC thresholds for a gene to be considered sufficiently expressed in your downstream analysis. Here, using the inbuilt is_exprs() function, we enforce that a gene must have least one count in at least 98 cells (as this is the smallest group of cells as identified by Zeisel et al). The rationale behind this is that we would want to keep a gene if it was expressed in just one group of cells, but we don’t want genes with very sparse expression overall.
keep_gene <- rowSums(is_exprs(sce_zeisel_raw)) >= 98
fData(sce_zeisel_raw)$use <- keep_geneWe retain 12386 genes for the analysis and drop 7677 lowly-expressed genes from the analysis.
It can be useful to plot gene expression frequency versus mean expression level to assess the effects of technical dropout in the dataset. We fit a non-linear least squares curve for the relationship between expression frequency and mean expression and use this to define the number of genes above high technical dropout and the numbers of genes that are expressed (here defined as at least 1 count) in at least 50% and at least 25% of cells. A subset of genes to be treated as feature controls can be specified, otherwise any feature controls previously defined are used.
plotQC(sce_zeisel_raw, type = "exprs") The plotQC() function provides several useful QC plots, such as the example below that considers the the number of reads consumed by the top 50 expressed genes. Aside from spike-ins, these are typically mitochondrial and housekeeping genes. Here, as with most single-cell experiments, a large proportion of reads are being are taken up by uninteresting biology.
plotQC( sce_zeisel_raw[fData(sce_zeisel_raw)$use, ],
type = "highest-expression", col_by_variable = "level1class" )As well as the expected ERCC and mitochondrial genes among the most expressed, we see Actb, involved in cell motility, structure and integrity.
A useful first step is flagging/failing poorly performing cells. This can be done from the sample meta-data using the automated QC metrics generated above,
any additional sequencing metrics from sequencing aligner/mapping software,
and additional cell phenotypes such as from imaging. For the sake of demonstration, here we focus on four metrics. Others you may want to consider are % reads mapped to mitochondrial genes, library PCR duplication rate,
and mean sequencing bias per cell.
plotPhenoData() can be used to explore specific sample meta-data values. For example, in the plots below we can see how different QC metrics distinguish the cells.
p1 <- plotPhenoData(sce_zeisel_raw, aes_string(x = "log10_total_counts",
y = "pct_counts_feature_controls_mt",
colour = "level1class"))
p2 <- plotPhenoData(sce_zeisel_raw, aes_string(x = "log10_total_counts",
y = "pct_counts_feature_controls_ERCC",
colour = "level1class"))
p3 <- plotPhenoData(sce_zeisel_raw, aes_string(x = "total_features",
y = "pct_counts_feature_controls_mt",
colour = "level1class"))
p4 <- plotPhenoData(sce_zeisel_raw, aes_string(x = "total_features",
y = "pct_counts_feature_controls_ERCC",
colour = "level1class"))
plot_grid(p1, p2, p3, p4, labels = letters[1:4], nrow = 2)plotPhenoData(sce_zeisel_raw, aes_string(x = "total_features",
y = "pct_counts_feature_controls",
colour = "filter_on_pct_counts_feature_controls",
shape_by = "filter_on_total_counts")) +
geom_vline(xintercept = 1000, linetype = 2)Use QC metrics to select a subset of cells for use.
sce_zeisel_raw$use <- (sce_zeisel_raw$total_features > 1000 & #sufficient features
sce_zeisel_raw$total_counts > 5000 & # sufficient molucules counted
!sce_zeisel_raw$filter_on_pct_counts_feature_controls & # sufficient endogenous RNA
!sce_zeisel_raw$filter_on_total_features & # remove cells with unusual numbers of genes
!sce_zeisel_raw$is_cell_control # controls shouldn't be used in downstream analysis
)
table(sce_zeisel_raw$use)##
## FALSE TRUE
## 27 2978
This would lead us to drop a further 27 cells from this dataset.
Box plots aren’t particularly useful for visualising sparse data, so plot() applied to an SCESet object helps visualise all cells as a cumulative proportion of reads per cell. You can see from the plot below that the two failed cells have curves that look more like the blank controls.
Cumulative expression plots:
plot(sce_zeisel_raw, block1 = "level1class", colour_by = "level2class",
exprs_values = "counts")We observe that certain cell types (e.g. oligodendrocytes and microglia) have a larger proportion of cells with their libraries accounted for by a small number of cells than to pyramidal cells.
plot(sce_zeisel_raw, block1 = "level1class", colour_by = "use",
exprs_values = "counts")The plots above show that some (but certainly not all) of the cells we have opted not to use have a large proportion of their library accounted for by a handful of very highly-expressed genes.
Scater allows users total flexibility to run their favourite dimension reduction methods, as decribed here and in the supporting help files. Here we use plotPCA() to further explore the cells. The t-SNE plot works particularly nicely for this dataset to separate the different cell types as identified by Zeisel et al.
Let’s take a look at a PCA plot:
sce_zeisel_raw <- plotPCA(sce_zeisel_raw, colour_by = "tissue",
return_SCESet = TRUE)plotReducedDim(sce_zeisel_raw, colour_by = "level1class")Take a look using a diffusion map (better suited to cells distributed along a process:
plotDiffusionMap(sce_zeisel_raw, colour_by = "level1class")And finally look at a t-SNE plot:
plotTSNE(sce_zeisel_raw, colour_by = "level1class", rand_seed = 20160128)The t-SNE gives nice, tight cell-type clusters.
plotPCA(sce_zeisel_raw, colour_by = "use", shape_by = "level1class" ) Another option available in
scater is to conduct PCA on a set of QC metrics. The advantage of doing this is that the QC metrics focus on technical aspects of the libraries that are likely to distinguish problematics cells. Automatic outlier detection on PCA plots using QC metrics is available to help identify potentially problematic cells.
By default, the following metrics are used for PCA-based outlier detection:
pct_counts_top_100_featurestotal_featurespct_counts_feature_controlsn_detected_feature_controlslog10_counts_endogenous_featureslog10_counts_feature_controlsA particular set of variables to be used can be specified with the selected_variables argument as shown in the example below.
## PCA on the phenoData cannot handle missing values
## for the exercise here we thus set cdna_recovered_in_ng_per_ul to 100 where
## there are NA values actually (creating a new dummy variable)
sce_zeisel_raw <- plotPCA(sce_zeisel_raw, size_by = "total_features",
shape_by = "filter_on_total_features",
pca_data_input = "pdata", detect_outliers = TRUE,
return_SCESet = TRUE)## The following cells/samples are detected as outliers:
##
## Variables with highest loadings for PC1 and PC2:
##
## PC1 PC2
## --------------------------------- ----------- ----------
## pct_counts_top_100_features 0.5125750 0.0770133
## pct_counts_feature_controls 0.5060459 0.1395898
## log10_counts_feature_controls 0.1337814 0.6556969
## n_detected_feature_controls 0.0683787 0.6457144
## total_features -0.4766248 0.2540722
## log10_counts_endogenous_features -0.4810821 0.2512876
This dataset has already been carefully QC’d by the original authors and this PCA plot confirms that, with the PCA on QC metrics detecting no further outliers.
With scater, any specific set of features based on prior knowledge can be used for PCA, t-SNE or diffusion maps. A feature set to use can be defined by supplying the feature_set argument to plotPCA, plotTSNE or plotDiffusionMap. This allows, for example, using only housekeeping features or control features or cell cycle genes to produce reduced-dimension plots. The plots below use only the spike-in genes defined as such earlier.
p1 <- plotPCA(sce_zeisel_raw, feature_set = fData(sce_zeisel_raw)$is_feature_control,
colour_by = "use", shape_by = "tissue",
size_by = "outlier" ) + ggtitle("PCA")
p2 <- plotDiffusionMap(sce_zeisel_raw,
feature_set = fData(sce_zeisel_raw)$is_feature_control,
colour_by = "use", shape_by = "tissue",
size_by = "outlier" ) + ggtitle("Diffusion map")
multiplot(p1, p2, cols = 2)In addition to outliers, we prefer to use cells that have a good coverage of detected genes. After gene filtering, we demand that cells have detectable expression for at least 10% of retained genes.
keep_cell <- (colSums(is_exprs(sce_zeisel_raw[fData(sce_zeisel_raw)$use,])) >
0.1 * sum(fData(sce_zeisel_raw)$use))
sce_zeisel_raw$use <- (sce_zeisel_raw$use & keep_cell)After this procedure we retain 2931 cells and drop 74 cells.
Once you have filtered cells and genes, a next step is to explore technical drivers of variability in the data to inform data normalisation before downstream analysis.
Experimental design is a critical, but neglected, aspect of sc-RNA-seq studies. To the best of our knowledge, methods like those described in this section for exploring experimental and QC variables and the experimental design, do not feature in any scRNA-seq software packages apart from scater. There are a very large number of potential confounders, artifacts and biases in sc-RNA-seq studies. Exploring the effects of such explanatory variables (both those recorded during the experiment and computed QC metrics) is crucial for appropriate modeling of the data. The scater package provides a set of methods specifically for quality control of experimental and explanatory variables, which will be demonstrated briefly here.
Using the plotPCA() function we can see that principal component one appears to be driven by differences in number of genes detected (“total features”). Differences in number of detected genes is a common driver of cell clustering and can be result of biology (e.g. different cell types, cell cycle). Indeed, in the PCA here we see that the endothelial-mural and astrocyte-ependymal cells have typically many fewer features than the other cell types.
However, number of genes detected often has a strong technical component due to variably recovered RNA, reverse transcription, or library amplification. Its effect can also be notably non-linear, affecting low expressed and high expressed genes differently. The plotQC() function can be used to explore the the marginal % variance explained (per gene) of the various technical factors. In the second plot we can see that it’s not unusual for gene coverage to explain more than 10% of the expression variance of a gene.
# we can easily subset to only plot genes and cells of interest
plotPCA( sce_zeisel_raw[fData(sce_zeisel_raw)$use, sce_zeisel_raw$use],
colour_by = "level1class", shape_by = "tissue", size_by = "total_features" )One can also easily produce plots to identify principal components that correlate with experimental and QC variables of interest. The function plotQC with the option type = "find-pcs" ranks the principal components in decreasing order of \(R^2\) from a linear model regressing PC value against the variable of interest. The default behaviour is to show the relationships between the variable of interest and the six principal components with the strongest relationship to the variable (as measured by \(R^2\)). This works both for continous and categorical variables. This type of plot can indicate which explanatory variables may be driving differences between cells as detected by PCA and highlight which PCs are associated with the variable. The “total features” variable shows very strong correlation with both principal components 1 and 2.
p1 <- plotQC(sce_zeisel_raw[fData(sce_zeisel_raw)$use, sce_zeisel_raw$use],
type = "find-pcs", variable = "total_features" )
p2 <- plotQC(sce_zeisel_raw[fData(sce_zeisel_raw)$use, sce_zeisel_raw$use],
type = "find-pcs", variable = "level1class" ) +
theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))
plot_grid(p1, p2, ncol = 2, labels = letters[1:2])Very clearly, the first principal component is affected strongly by the total number of features detected in a cell.
The relative importance of different explanatory variables can be explored with some of the plotQC function options. Supplying the type = "expl" argument to plotQC computes the marginal \(R^2\) for each variable in the when fitting a linear model regressing expression values for each gene against just that variable, and displays a density plot of the gene-wise marginal \(R^2\) values for the variables. The default approach looks at all variables in the slot of the object and plots the top nvars_to_plot variables (default is 10).
Alternatively, one can choose a subset of variables to plot in this manner, which we do here. The density curves for marginal \(R^2\) show the relative importance of different variables for explaining variance in expression between cells. In the plot we can see that it’s not unusual for gene coverage to explain more than 10% of the expression variance of a gene.
plotQC(sce_zeisel_raw[fData(sce_zeisel_raw)$use, sce_zeisel_raw$use],
type = "explanatory-variables",
variables = c("pct_counts_top_100_features", "total_features",
"pct_counts_feature_controls", "level1class",
"n_detected_feature_controls",
"log10_counts_endogenous_features",
"log10_counts_feature_controls", "tissue") )This analysis indicates that total number of features detected and the sequencing depth (number of counts) for endogenous genes, in particular, have substantial explanatory power for many genes, so these variables are good candidates for conditioning out in a normalisation step, or including in downstream statistical models. The number of detected feature controls (spike-in genes) does not appear to be an important explanatory variable.
The plotQC() function can also be used to produce a pairs plot of explanatory variables (with the same calls as above, but with method = "pairs"). The plot below shows this use case for looking at the % counts from the top 100 most-expressed genes, the total number of expressed genes, the % of counts from feature controls, the number of detected feature controls, the number of counts (on the log-10 scale) from endogenous features, the number of counts (log-10 scale) from feature controls and sample type. The explanatory variables are ordered by the median \(R^2\) of the variable across all genes, and this value is reported on the plot. This type of plot is useful for finding correlations between experimental and QC variables with substantial explanatory power.
plotQC(sce_zeisel_raw[fData(sce_zeisel_raw)$use, sce_zeisel_raw$use],
type = "expl", method = "pairs",
variables = c("pct_counts_top_100_features", "total_features",
"pct_counts_feature_controls",
"n_detected_feature_controls",
"log10_counts_endogenous_features",
"log10_counts_feature_controls", "level1class"),
theme_size = 6)
The plot above shows that the variables level1class (cell type), total_features, log10_counts_endogenous_features, pct_counts_top_100_features and pct_counts_feature_controls are all potentially important explanatory variables, while tissue is less important and n_detected_feature_controls and log10_counts_feature_controls do not look important at all.
After important explanatory variables have been identified with the tools shown above, their effects can be accounted for in subsequent statistical models, or they can be conditioned out using normaliseExprs(), if so desired. If a design matrix incorporating a selection of explanatory variables is supplied as an argument to normaliseExprs, then normalised expression values returned for each feature will be the residuals from a linear model fitted with the design matrix, after any size-factor normalisation has been applied to the expression data.
Normalising single-cell RNA-seq data is a topic in its infancy, but many of the basic principles still apply. How much you choose to initially correct for technical factors depends on your question of interest and the ease with which they can be accounted for in downstream models.
In scater it is easy to perform size-factor normalisation using the “TMM” approach (the default in the edgeR package), the “RLE” approach (default in the DESeq package), an “upper-quartile approach” (as proposed by Bullard et al (2010)) or simple scaling by total counts.
The code below demonstrates how to carry out size-factor normalisation on a subsetted SCESet object, normalising either to ERCC spike-in genes or to endogenous genes. Under certain circumstances either may be appropriate. Careful thought should be given to applying scale-factor normalisation as underlying assumptions may not be appropriate for all single-cell datasets.
For instance, the RLE size-factor method does not work for this data set, as the distribution of expression values for each cell has too many zeroes for this method to handle. Similarly, the Bullard et al method only works here for quantile values greater than 90%, because the quantile values less than 90% give a value of zero for numerous cells - when this is the case, the normalisation method fails. If inappropriate normalisation factors are computed, a warning is provided and normalisation factors are set to 1 (the same as applying method="none").
## subset to form a QC'd version of the data
sce_zeisel_qc <- sce_zeisel_raw[fData(sce_zeisel_raw)$use |
fData(sce_zeisel_raw)$is_feature_control,
sce_zeisel_raw$use]
endog_genes <- !fData(sce_zeisel_qc)$is_feature_control
sce_zeisel_qc <- normaliseExprs(sce_zeisel_qc, method = "upperquartile",
feature_set = endog_genes, p = 0.99)
summary(sce_zeisel_qc$norm_factors)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5754 0.9116 0.9806 1.0100 1.0740 1.8830
Here normalisation to the 99th percentile of the libraries is shown - this avoids using extremely low values, but also avoids the normalisation being unduly influenced by a small number of very highly expressed genes.
We can investigate the effects of normalisation easily with PCA plots, here using only endogenous gene expression values to produce the plots.
## subset again so that only endogenous genes are used
plt_pca_prenorm <- plotPCA(
sce_zeisel_qc[fData(sce_zeisel_qc)$use,], exprs_values = "exprs",
size_by = "total_features", colour_by = "level1class", shape_by = "tissue") +
ggtitle("PCA - no normalisation") +
theme(legend.position = "bottom")
plt_pca_endog_norm <- plotPCA(
sce_zeisel_qc[fData(sce_zeisel_qc)$use,], exprs_values = "norm_exprs",
size_by = "total_features", colour_by = "level1class", shape_by = "tissue") +
ggtitle("PCA - endogenous size-factor normalisation") +
theme(legend.position = "bottom")
multiplot(plt_pca_prenorm, plt_pca_endog_norm, cols = 2)Here, where there are a number of distrinct cell types, we see that size-factor normalisation has little effect on the large-scale structure in the dataset as determined by PCA.
However, in the t-SNE plots below the overall structure is very similar in both cases, but we see that the normalised data brings combines the two clusters of oligodendrocytes from the un-normalised plot, and that the microglia and endothelial-mural cells are better differentiated with the normalised data.
## subset again so that only endogenous genes are used
plt_tsne_prenorm <- plotTSNE(
sce_zeisel_qc[fData(sce_zeisel_qc)$use,], exprs_values = "exprs",
size_by = "total_features", colour_by = "level1class", shape_by = "tissue",
random_seed = 20160203) +
ggtitle("t-SNE - no normalisation") +
theme(legend.position = "bottom")
plt_tsne_endog_norm <- plotTSNE(
sce_zeisel_qc[fData(sce_zeisel_qc)$use,], exprs_values = "norm_exprs",
size_by = "total_features", colour_by = "level1class", shape_by = "tissue",
random_seed = 20160203) +
ggtitle("t-SNE - endogenous size-factor normalisation") +
theme(legend.position = "bottom")
multiplot(plt_tsne_prenorm, plt_tsne_endog_norm, cols = 2)“Customised” normalisation approaches can be easily incorporated into the scater workflow, because it is very easy to add data to an SCESet obejct with the set_exprs function. In a subsequent scater tutorial we provide an example in which we normalise and standardise counts conditioned on expression level. One advantage of this approach is that a biologically ‘noisy’ gene is naturally defined as one with greater dispersion than other genes at a similar expression level. In the normalised data these are genes with a mean absolute deviation greater than 1.
Other sophisticated normalisation approaches can be attempted, such as those implemented in the BASiCS package. Alternative normalisation approaches can easily be incorporated into a scater workflow.
Thus, after convenient pre-processing, QC and normalisation with , the data are well organised (with feature and cell metadata and many data transformations), clean and tidy, and are ready for further statistical modeling and analysis.
Expression data at a single-cell resolution not only allows testing for differential expression, but exploring how this is dependent on sub-types of cells and/or how genes coexpress with each other across cells. As with normalisation, single-cell analysis methodology is an area in its infancy and deserving of discussion that is beyond this case study. Here we simply demonstrate how QC’ed and normalised data contained within an SCESet object allows for easy downstream interrogation. Let’s assume we are interested in defining differential expression as change in expression frequency. This can be done with a standard generalised linear model and the qvalue package to control false discovery rates.
Here, we simply test whether the expression frequency of each gene for endothelial-mural cells is different from that of astrocytes-ependymal cells (which is the default baseline in this model).
library("qvalue")
celltype <- sce_zeisel_qc$level1class
de <- data.frame(t(apply(1 * is_exprs(sce_zeisel_qc), 1, # test change in expression frequency
function(y) coef(
summary(glm(y ~ celltype, family = "binomial")))[2, c(1, 4)])),
check.names = FALSE )
de$qvalue <- qvalue(de[, "Pr(>|z|)"], fdr.level = 0.05)$qvalues # control for false disovery rate
de <- de[order(de$qvalue, decreasing = FALSE ), ] # order by global statistical evidence
datatable(de[1:1000,])In scater the plotExpression function enables the convenient visualisation of expression values for a set of features. Here, the expression values for the six most DE genes for expression frequency between cell types are shown. The units for expression in the plot can be defined with the exprs_values argument (the expression values must exist with the provided name in the assayData slot of the SCESet object; if not the default exprs values will be used, with a warning). As with other plots in scater we can use phenotype data variables to define the colour and shape of the points.
plotExpression(sce_zeisel_qc, rownames(de)[1:6], x = "level1class", ncol = 3,
colour_by = "level1class") +
theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))plotExpression(sce_zeisel_qc, rownames(de)[1:6], x = "level1class", ncol = 3,
colour_by = "level1class", exprs_values = "norm_counts", log = TRUE) +
theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))plotExpression(sce_zeisel_qc, tail(rownames(de), n = 6), exprs_values = "norm_exprs",
x = "level1class", colour_by = "level1class",
shape_by = "tissue", ncol = 3) +
theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))
As one would hope, the genes with the least evidence for differential expression frequency are spike-in genes.
The scater package also makes it easy to investigate correlations between genes and cells by supporting cell-cell and gene-gene distances in an SCESet object. Below we show how cell-cell distances can be defined and then how these distances can easily be used to produce multidimensional-scaling (MDS) plots to investigate the structure between cells.
cellDist(sce_zeisel_qc) <- (1 - cor(norm_exprs(sce_zeisel_qc),
method = "spearman")) / 2
plotMDS(sce_zeisel_qc, size_by = "total_features", colour_by = "level1class",
shape_by = "tissue") +
ggtitle("MDS plot based on spearman correlation")cellDist(sce_zeisel_qc) <- as.matrix(dist(t(counts(sce_zeisel_qc)),
method = "canberra"))
plotMDS(sce_zeisel_qc, size_by = "total_features", colour_by = "level1class",
shape_by = "tissue") +
ggtitle("MDS plot based on spearman correlation")With the plotMDS function the user can apply their favourite way of defining the distance between cells to then plot the cells in a low-dimensional space.
Finally, let us save the QC’d version of the data that can be used for any subsequent analyses.
save(sce_zeisel_qc, file = "zeisel_sce_qc.RData", compress = "xz",
compression_level = 9)Scater has been tested on Mac OS X and Linux environments and requires the R packages:
and imports the packages:
This case study was run using the following platform and R package versions:
## R Under development (unstable) (2016-01-24 r69993)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.3 (El Capitan)
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] qvalue_2.3.2 knitr_1.12.3 DT_0.1
## [4] cowplot_0.6.0 data.table_1.9.6 scater_0.99.1
## [7] ggplot2_2.0.0 Biobase_2.31.3 BiocGenerics_0.17.3
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-124 bitops_1.0-6 matrixStats_0.50.1
## [4] destiny_1.1.0 pbkrtest_0.4-6 tools_3.3.0
## [7] R6_2.1.2 DBI_0.3.1 lazyeval_0.1.10
## [10] mgcv_1.8-11 colorspace_1.2-6 nnet_7.3-12
## [13] sp_1.2-2 gridExtra_2.0.0 GGally_1.0.1
## [16] chron_2.3-47 sgeostat_1.0-27 quantreg_5.19
## [19] formatR_1.2.1 SparseM_1.7 sROC_0.1-2
## [22] labeling_0.3 scales_0.3.0 DEoptimR_1.0-4
## [25] lmtest_0.9-34 mvtnorm_1.0-5 robustbase_0.92-5
## [28] proxy_0.4-15 stringr_1.0.0 digest_0.6.9
## [31] minqa_1.2.4 rmarkdown_0.9.5 rrcov_1.3-8
## [34] htmltools_0.3 lme4_1.1-10 limma_3.27.11
## [37] highr_0.5.1 htmlwidgets_0.5 RSQLite_1.0.0
## [40] FNN_1.1 zoo_1.7-12 jsonlite_0.9.19
## [43] dplyr_0.4.3 car_2.1-1 RCurl_1.95-4.7
## [46] magrittr_1.5 Matrix_1.2-3 Rcpp_0.12.3
## [49] munsell_0.4.2 S4Vectors_0.9.28 viridis_0.3.2
## [52] scatterplot3d_0.3-36 stringi_1.0-1 yaml_2.1.13
## [55] edgeR_3.13.4 cvTools_0.3.2 MASS_7.3-45
## [58] zlibbioc_1.17.0 rhdf5_2.15.2 Rtsne_0.10
## [61] plyr_1.8.3 grid_3.3.0 pls_2.5-0
## [64] lattice_0.20-33 splines_3.3.0 igraph_1.0.1
## [67] rjson_0.2.15 reshape2_1.4.1 biomaRt_2.27.2
## [70] stats4_3.3.0 XML_3.98-1.3 evaluate_0.8
## [73] vcd_1.4-1 nloptr_1.0.4 MatrixModels_0.4-1
## [76] VIM_4.4.1 gtable_0.1.2 reshape_0.8.5
## [79] assertthat_0.1 RcppEigen_0.3.2.7.0 e1071_1.6-7
## [82] class_7.3-14 pcaPP_1.9-60 robCompositions_2.0.0
## [85] AnnotationDbi_1.33.7 IRanges_2.5.25 mvoutlier_2.0.6
## [88] cluster_2.0.3